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Abstract 

It is proved that the Hmiting distribution of the length of the longest weakly increasing 
subsequence in an inhomogeneous random word is related to the distribution function for 
the eigenvalues of a certain direct sum of Gaussian unitary ensembles subject to an overall 
constraint that the eigenvalues lie in a hyperplane. 

1 Introduction 

A class of problems — important for their applications to computer science and computational 
biology as well as for their inherent mathematical interest — is the statistical analysis of a 
string of random symbols. The symbols, called letters, are assumed to belong to an alphabet 
A of fixed size k. The set of all such strings (or words) of length A'^, W{A,N), forms the 
sample space in the statistical analysis of these strings. A natural measure on W is to assign 
each letter equal probability, i.e. 1/k, and to define the probability measure on words by the 
product measure. Thus each letter in a word occurs independently and with equal probability. 
We call such random word models homogeneous. 

Of course for some applications, each letter in the alphabet does not occur with the same 
frequency and it is therefore natural to assign to each letter i a probability pi . If we again use 
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the product measure for the words (letters in a word occur independently), then the resulting 
random word models are called inhomogeneous. 

Fixing an ordering of the alphabet A, a weakly increasing subsequence of a word 

w = a\a2 ■ ■ ■ oiN € W 

is a subsequence ai-^ai^ ■ ■ ■ ai^ such that ii < i2 < ■ ■ ■ < im and < ai^ < • • • < oti^. The 
positive integer m is called the length of this weakly increasing subsequence. For each word 
w G W we define I!.n{w) to equal the length of the longest weakly increasing subsequence in 
w. We now define the fundamental object of this paper: 

FN{n) := Prob {In{w) < n) 

where Prob is the inhomogeneous measure on random words. Of course, Prob depends upon 
N and the probabilities pi. 

Our results are of two types. To state our first results, we order the pi so that 

Pi>P2>--->Pk 

and decompose out alphabet A into subsets ^i, A2, ■ ■ ■ such that pi = pj if and only if i and 
j belong to the same Aa- Setting ka = \Aa\, we show that the limiting distribution function 
as ^ 00 for the appropriately centered and normalized random variable ^tv is related to 
the distribution function for the eigenvalues in the direct sum of mutually independent 
ka X ka Gaussian unitary ensembles (GUE)J^ conditional on the eigenvalues satisfying 
J2\^Ci — 0- III the case when one letter occurs with greater probability than the others, 
this result implies that the limiting distribution of (^at — Npi) /^/N is Gaussian with variance 
equal to — pi). In the case when all the probabilities pi are distinct, we compute the 
next correction in the asymptotic expansion of the mean of and find that 

E(^^) = iVpi + ^ + 0(-^), iV ^ 00. 

Pi - Pj VN 

This last formula agrees quite well with finite N simulations. We expect this asymptotic 
formula remains valid when one letter occurs with greater probability than the others. 

These results generalize work on the homogeneous model by Johansson [11^] and by Tracy 



and Widom |17]. Since all the probabilities pi are equal in the homogeneous model, the 



underlying random matrix model is k x k traceless GUE. That is, the direct sum reduces to 



just one term. In [17| the integrable system underlying the finite N homogeneous model was 
shown to be related to Painleve V. In the isomonodromy formulation of Painleve V the 
associated 2x2 matrix linear ODE has two simple poles in the finite complex plane and one 



basic reference for random matrices is Mehta's book [h2f 
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Poincare index 1 irregular singular point at infinity. In Part II we will show that the finite N 
inhomogeneous model is represented by the isomonodromy deformations of the 2x2 matrix 
linear ODE which has m + 1 simple poles in the finite complex plane and, again, one Poincare 
index 1 irregular singular point at infinity. The number m is the total number of the subsets 
Aa, and the poles are located at zero point and at the points —pi^ {ia = max Aa)- The 
integers ka appear as the formal monodromy exponents at the respective points —pi^- We 
will also analyse the monodromy meaning of the asymptotic results obtained in this part. 

The results presented here are part of the recent flurry of activity centering around connec- 
tions between combinatorial probability of the Robinson-Schensted-Knuth (RSK) type on the 
one hand and random matrices and integrable systems on the other. From the point of view 
of probability theory, the quite surprising feature of these developments is that the methods 
came from Toeplitz determinants, integrable differential equations of the Painleve type and 
the closely related Riemann-Hilbert techniques. The flrst to discover this connection at the 
level of distribution functions was Baik, Deift and Johansson |Q] who showed that the limit- 
ing distribution of the length of the longest increasing subsequence in a random permutation 
is equal to the limiting distribution function of the appropriately centered and normalized 
largest eigenvalue in the GUE [15|. This result has been followed by a number of devel- 
opments relating random permutations, random words and more generally random Young 
tableaux to the distribution functions of random matrix theory |2|, 0, 0, R, 11, 13, 16]. 



2 Random Words 

2.1 Probability Measure on Words and Partitions 

The Robinson-Schensted-Knuth (RSK) algorithm is a bijection between two-line arrays wa 
(or generalized permutation matrices) and ordered pairs (P, Q) of semistandard Young tableaux 
(SSYT).| When the two-line arrays have the special form 

f I 2 ■■■ N \ 

WA= [ ] , 

y ai Q2 • • • CtN J 

ai G A = {1,2, ... ,k}, we identify each wa with a word w = aia2 • • • of length N 
composed of letters from the alphabet A; furthermore, in this case the insertion tableaux P 
have shape X h N, i{X) < k, with entries coming from A and the recording tableaux Q are 
standard Young tableau (SYT) of the same shape A. As usual, denotes the number of 
SYT of shape A and dx{k) the number of SSYT of shape A whose entries come from A. 

We define a probability measure, Prob, on W{A,N), the set of all words w of length N 
formed from the alphabet A, by the two requirements: 

For a detailed account of the RSK algorithm see Stanley, Chp. 7 0. We use without further reference 
various results from symmetric function theory all of which can be found in Stanley. 
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1. For each word w consisting of a single letter i £ A, Prob(t<; = i) = pi, < pi < 1, with 

2. For each w = a\a2 • ■ ■ oin G W and any ij £ A, j = 1,2, . . . , 

N 

Prob (aia2 • • • otv = ^1^2 • • • ^a^) = Prob (oj = ij) (independence). 
Of course, Prob depends both on and the probabilities {pi}. 

Under the RSK correspondence, the probability measure Prob induces a probability measure 
on partitions \ ^ N , which we will again denote by Prob. This induced measure is expressed 
in terms of and the Schur function. To see this we first recall that a tableau T has type 
a = («!, 02, • • •)) denoted a = type(T), if T has Oj = aj(T) parts equal to i. We write 

T _ ai(T) a2{T) 

The combinatorial definition of the Schur function of shape A in the variables x = [xi, X2, ■ ■ ■) 
is the formal power series 

s\{x) = ^x'^ 

T 

summed over all SSYT of shape A. The p = {pi, . . . ,pk} specialization of s\{x) is s\{p) = 
s\{pi,P2, ■ ■ ■ ,Pfc,0,0, . . .). 

For each word w {P, Q), the entries of P consist of the letters of w since P is formed 
by successive row bumping the letters from w. Because of the independence assumption, 

P — Pi P2 Pk 

gives the weight assigned to word w. From the combinatorial definition of the Schur function, 
we observe that its p specialization is summing the weights of words w that under RSK have 
shape A h A^. The recording tableau Q keeps track of the order of the letters in the word. The 
weights of any words with the same number of letters of each type are equal (independence), 
so we need merely count the number of such Q, i.e. and multiply this by the weight of 
any given such word to arrive at the induced measure on partitions, 

Prob(A) = sa(p)/\ (2.1) 

which satisfies the normalization X^AhAf P'^ob(A) = 1. For the homogeneous case pi = 1/k, 
the measure reduces to 

Prob(A) = sx{l/k, l/k, 1/k) = X^N. 

The Poissonization of this homogeneous measure is called the Charlier ensemble in ||l 
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If In{w) equals the length of the longest weakly increasing subsequence in the word w G 
yV(^, A^), then by the RSK correspondence w (P, Q), the number of boxes in the first row 
of P, Ai, equals (.n{w). Hence, 

Prob {tN{w) <n)=Y. sx{p) f\ (2.2) 

XhN 
A-|^ <n 



2.2 Toeplitz Determinant Representation 

Gessel's theorem Q is the formal power series identity^ 

sx{x)sx{y) = det(r„((^)) 

AhJV 

where Tn{^) is the nx n Toeplitz matrix whose i,j entry is (fi-j, where ipi is the Fourier 
coefficient of 

oo oo 
n=l n=l 

If we define the (exponential) generating function 

oo ,7V 

G7(n;fe},0 = ^ Vioh{ij,{w) < n) — , 

Af=0 

then an immediate consequence of Gessel's identity with p specialization of the x variables 
and exponential specialization of the y variables and the RSK correspondence is 

G/(n;fe},t) =det(r„(/7)) (2.3) 

where 

k 

fiiz)=e'/' 11(1+ Pjz). (2.4) 



3 Limiting Distribution 



We start with the probability distribution (2J) on the set of partitions A = {Ai, A2, . . . , A^} h 
N. For f"^ we use the formula 

fX_ NlAih) 



hi\h2\---hk\ 



Precisely, we use the dual version of Gessel's Theorem, see §11 in ]17(] whose notation we follow. 
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where 
and 

Equivalently, 



hi — \j ~\~ h — i 

Aih) = Aihi,h2,...,hk)= n (hi-hj). (3.1) 

l<i<j<fe 



A{h) ( N 



nti'ny(A. + fc-j) M ■■■ \k)- 

The (classical) definition of the Schur function is 

det (^Pj j \ /o o^ 

''^^^^ " A(p) " ^ ^"^ Pi ^2 ^ • • -Pfc • (3-2) 

This holds when all the pi are distinct but in general the two determinants require modifica- 
tion, which we now describe. We order the pi so that 

Pi>P2>--->Pk (3.3) 

and decompose our alphabet ^ = {1, 2, . . . , A;} into subsets ^i, A2, ■ ■ ■ such that pi = pj if 
and only if i and j belong to the same Aa ■ Set ia = max Aa ■ Think of the pi as indeterminates 
and for all indices i differentiate the determinant ia — i times with respect to if i S Aa- 
Then replace the pi by their given values. (That this is correct follows from I'Hopital's rule.) 
If we set ka = \Aa\ and write pa for pi^ then we see that A(p) becomes 

A'{p) = l[{V.2\---ika-l)l) liiPo^-Pp)'"'' (3-4) 

a a</3 

and (after performing row operations) that the ith. row of det (^p^^^ becomes {h^j^~^ p^^ «q+»^ ^ 
Equivalently, the partial product Hie^c. Pi'"'*' from the summand in (^]^) gets multiplied by 

n {k"(^)pv-^') = [ n K%{\ p-'"^'--'^/'. (3.5) 

In the case of distinct pi we write our formula as 

Prob(A) = SA(pi,...,Pfc)/^ 

- Aip) ni- nS(A. + k-j) ^ • • ^ J ■ 
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Let Mg{X) denote the multinomial distribition associated with a sequence q = {qi, . . . , qk}, 




li Pa denotes the sequence {Po— • • • )?'o-i(fc)}) then the above may be written 

This is the formula for distinct pi. In the general case we must replace A(p) by A'(p) and 
each partial product Ojey^Q Pi '^^^^ appearing in the sum on the right must be multiplied by 



the factor (|37 

The multinomial distribution Mg(A) has the property that the total measure of any region 
where |Aj — Nqi\ > eA^ for some i and some e > tends exponentially to zero as — > cc. 
All the other terms appearing in (|3.6| ) or its modification are uniformly bounded by a power 
of N. Since Aj+i < Aj for all i it follows that the contribution of the terms involving Mq(X) 
in ( |3.6| ) will tend exponentially to zero unless g^+i < qi for all i. Since q^ = Pcr-'^[i) this 
shows that the contribution to ( |3.6D of the summand corresponding to a is exponentially 
small unless a leaves each of the sets Aa invariant. It follows that if we denote the set of such 
permutations by 5[. then we may restrict the sum in (|3.6| ) to the cr S 5^ without affecting 
the limit. Observe that when cr G S"^ all the Mp^(A) appearing in ( |3.6D equal Mp{X). 

Write 

Xi = Npi + ^/WiCi- 
In terms of the ^i the multinomial distribution Mp(A) converges to 

(2vr)-(^-i)/2e-i:C?/25(^^e.). (3.7) 



(See section 'iA.) Here 6{J2 \/QiCi) denotes Lebesgue measure on the hyperplane J2 = 0- 

We now consider the contribution of the other terms in (|3.6| ) as modified. Again, they are 
uniformly bounded by a power of N and the total measure of any region where \ Xi—Npi\ > eA^ 
for some i and some e > tends exponentially to zero as ^ cxd. Thus in determining the 
asymptotics of the other terms we may assume that Aj ~ Npi for all i. 

The constant A'(p) is given by (^.4|). As for A(/i), observe that the factor 

hi - hj = Xi - Xj - i + j 

in the product in (^) is asymptotically equal to A^ {pi — pj) when i and j do not belong to 
the same Aa and to yJN p^ {ii — ■^j) if i, J £ Aa- It follows that 

a</3 Oi 
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where Aq,(^) is the Vandermonde determinant of those with i £ Aa- 

The next factor in (|3.6|) , the reciprocal of the double product, is asymptotically 

k-l 



t=l 



As for the sum in ( |3.6| ) as modified, observe that since each a now belongs to 5^ each product 



appearing there is equal to Hpf Each such product is to be multiplied by 



n 



(See ( ^.Sl) .) Hence the sum itself is equal to 



E (-1)^ n n K 



(i) 



Since each a G 5"^ is uniquely expressible as a product of ctq, G S{Aa) (where S{Aa) is the 
group of permutations of Aa) we have 



E i-^r n n = n E (-i)'^- n 



(0 



a a 

Putting all this together shows that the limiting distribution is 

(2vr)-(^-i)/2 Will 21... [ka- ly.)-' n ^c^iO' e- 5(E VS?.). (3.8) 



This has a random matrix interpretation. It is the distribution function for the eigenvalues 
in the direct sum of mutually independent x k^ Gaussian unitary ensembles, conditional 
on the eigenvalues satisfying J2 \/Pi^i = 0- 

It remains to determine the support of the limiting distribution. In terms of the the 
inequalities Aj+i < Aj are equivalent to 



N{pi-pi+i) , Pi ^ 

< V J t,i. 

\ Pi+i 



Npi 

In the limit — > oo this becomes no restriction if pi+i < pi but becomes ^i+i < if Pi+i = Pi- 
Otherwise said, the support of the limiting distribution is restricted to those {^j} for which 



8 



< £,i whenever i and i + 1 belong to the same Aa- (In the random matrix interpretation 
it means that the eigenvalues within each GUE are ordered.) We denote this set of by H. 

It now follows from (|2.2|) and (^) (also recall the ordering (p.3D) that 



lim Prob ( ^JL^Pl <s\ = (2^)-(^-i)/2 17(1! 2! • • • (fc„ - x (3.9) 



6eH " 

When the probabilities are not all equal this may be reduced to a A;i-dimensional integral as 
follows. Let i denote the indices in and j the other indices. We have to integrate 

n A,(0' e-^ <5(^ VK^, + ^ ^e,) 

over the subset of H where £,i < s. Since = max^j and since the integrand is symmetric 
in the and the within their groups we may (by changing the normalization constant) 
integrate over all < s and all ^j. We first fix the and integrate over the ^j. These have 
to satisfy 

If we write 

^j=Vj+X^/Pj (3.10) 
where {rjj} is orthogonal to {^/pj} then 

(Recall that A\ has k\ indices.) For each a > 1 we have Aa(^) = AQ,(r7) since the within 
groups are equal and 

Hi] = E^l + -^Ep. = E^f + Y^S^^-^'- 

So the distribution function is equal to a constant times 

J-oo J-oo J 

where the rj integration is over the orthogonal complement of {^/Pj}- The rj integral is just 
another constant. Therefore the distribution function equals 

r ... r A(o^e~^E«'+Tri^(E«o^]^^^...^^^^^ 
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where Ck^ ^ is the integral over all of R^^ . 

To evaluate this we make the substitution ( 3.10| ) , but with j replaced by i and each pj replaced 
by The integral becomes 



dx, 



taken over x S R and r] in hyperplane ^r]i = with Lebesgue measure. The x integral 
equals \/27rA:i(l — kipi) while the first integral equals (27r)*^'^i~^^/^ l! 2! • • • ki\. (For the last, 



observe that the right side of (3^) must equal 1 when s = oo.) Hence 



Ck,,p, = (27r)'=i/2 1! 2! . . . J h{l - km) 



3.1 Distinct probabilities — the next approximation 

If all the Pi are different then -P(A) := Prob(A) equals 

A{h) 1 

n^ny(A. + A:- j) t\ 

plus an exponentially small correction. We recall that 



(3.11) 



A, = Npj + ^/Npj Cj 



and compute the Fourier transform of the measure P with respect to the ^ variables. Begin- 
ning with Mp, we have 

Mp{x) = J e'^""^^^ dMp{X) = e-^E\/^^. j e^E^i^./\/^ dMp{\) 



N 



since Mp is the multinomial distribution. An easy computation gives 



Mp{x) 



) = (l + ^Q(x) +0(1)) e-^S^'+^(E 



where Q{x) is a homogeneous polynomial of degree three. (In particular the limit of Mp is 
the inverse Fourier transform of the exponential in the above formula, which equals (p.7|).) 

As for the other nonconstant factors in ( 3.11| ), we have 



fc-lfc-l k-l 

n iK^^ + ^ - = n i^p^ + + 0(1))* 

i=l j=i i=l 
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and 



i=l V-iV \J h'l 

A{h) = n [Nip^ - p,) + ^(VpA - VPj^j) + 0(1) 



iV^ Pi-Pi 



Thus the factors in ( |3.11| ) aside from Mp contribute 

1+ J-(y " -y i^UQ(l) = i+^(yM ^^-^^ " ^-^^ Uo(l 



VP^ 



N^'-^.\l Pi Pi-Pj 



Using the fact that multiphcation by corresponds, after taking Fourier transforms, to —idxj 
and combining this with the preceding we deduce that P{x), the Fourier transform of P{X) 
with respect to the ^ variables, equals 



1 + 



N-yp 



Pi - Pj VN 
plus a correction which is exponentially small in N. 



3.1.1 The mean 



We have 

E(ei) = / 6 dP{x) = -idx,p{x] 

From the above we see that this equals 



x=0 



Hence 



E{£n) = E(Ai) = Np^ + J2 + O(^), 



oo. 



(3.12) 



This last formula is, in fact, an accurate approximation for E(£7v) (for distinct pi) for moderate 
values of A^. Table |l] summarizes various simulations of ^at and compares the means of 
these simulated values with the asymptotic formula. We remark that even though the proof 
assumed distinct pi , we expect the asymptotic formula to remain valid for pi > P2 ^ ■ ■ ■ ^ Pk- 
(See the last set of simulations in Table |l|.) 
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3.1.2 The variance 



Let us write our approximation as P = Pq + N ^^^Pi + 0{N ^) with corresponding expected 
values E = Eo + N-^/^Ei + 0{N-^). {In fact Pi is a distribution, not a measure, but the 
meaning is clear.) Then the variance of Ai is equal to 



Of course Eo(^i) = 0, but also 



1 



Ei(^f) 



Mei) = -di,,,Pi{x) 



^^Eo(a)Ei(ei) + o(^; 



0. 



x=0 



Since 



Eo(ef)-Eo(ei)' = i-Pi 



we find that the variance of Ai equals Npi{l —pi) + 0{l) and so its standard deviation equals 
ViVpi(l-pi) + 0(iV-V2). 
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k 


Probabilities 
of {l,...,k} 


N 


Ns 


Mean 




2 


{ 5/7, 2/7} 


50 


20 000 


36.37 


36.38 






100 


20 000 


72.12 


72.10 






500 


20 000 


357.73 


357.81 


2 


{6/11, 5/11} 


50 


20 000 


30.54 


32.27 






100 


20 000 


58.52 


59.55 






200 


20 000 


113.71 


114.09 






400 


20 000 


223.16 


223.18 


3 


{1/2, 5/14, 1/7} 


50 


10 000 


27.53 


27.90 






100 


10 000 


52.79 


52.90 






500 


10 000 


252.80 


252.90 






1000 


10 000 


502.78 


502.90 


3 


{3/8, 1/3, 7/24} 


50 


10 000 


23.96 


30.25 






100 


10 000 


44.33 


49.00 






500 


10 000 


197.65 


199.00 






1000 


2 000 


386.08 


386.50 


3 


{3/8, 5/16, 5/16} 


50 


10 000 


23.92 


28.75 






100 


10 000 


44.16 


47.50 






200 


10 000 


83.15 


85.00 






400 


10 000 


159.30 


160.00 






800 


10 000 


310.08 


310.00 



Table 1: Simulations of the length of the longest weakly increasing subsequence in inho- 
mogeneous random words of length N for two- and three-letter alphabets. Ns is the sample 
size. The last column gives the asymptotic expected value ( ^.12 ). 
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